[1] 2751713 16
Purpose of this quarto-html
- To explore the dataset from Zhang et al.
- To try to determine if we can find a link between weather and train delays.
Lets begin by having a look at some very basic features of the dataset we have just found:
Dimensions
Column names
[1] "date" "train_number"
[3] "train_direction" "station_name"
[5] "station_order" "scheduled_arrival_time"
[7] "scheduled_departure_time" "stop_time"
[9] "actual_arrival_time" "actual_departure_time"
[11] "arrival_delay" "departure_delay"
[13] "wind" "weather"
[15] "temperature" "major_holiday"
The dataset seems well structured, but is also quite large. It does contain some useful headers and columns.
Overview of the data - Summary statistics
To get a quick glimpse of the data we can have a look at some summary statistics.
| station_name | Mean_arriv | Mean_depar | stdev_arriv | stdev_delay | n | unique_arriv | unique_dep | Mean_temp |
|---|---|---|---|---|---|---|---|---|
| Jianwei Railway Station | 532.0000 | 532.0000 | 0.00000 | 0.00000 | 29 | 1 | 1 | 10.793103 |
| Yuzhou Railway Station | 531.4444 | 531.4444 | 75.67617 | 75.67617 | 36 | 3 | 3 | 5.805556 |
| Guanyun Railway Station | 500.0132 | 500.0132 | 178.21474 | 178.21474 | 151 | 9 | 9 | 6.622517 |
| Fangcheng Railway Station | 485.4444 | 485.4444 | 75.67617 | 75.67617 | 36 | 3 | 3 | 6.055556 |
| Jieshounan Railway Station | 465.2176 | 465.2176 | 359.13111 | 359.13111 | 239 | 18 | 18 | 5.941423 |
| Xingandong Railway Station | 416.3605 | 416.3605 | 212.42152 | 212.42152 | 147 | 10 | 10 | 11.476190 |
| train_number | Mean_arriv | Mean_depar | stdev_arriv | stdev_delay | n | unique_arriv | unique_dep | Mean_temp |
|---|---|---|---|---|---|---|---|---|
| G4027 | 853.1429 | 696.7143 | 382.2505 | 542.9797 | 7 | 7 | 7 | 28.85714 |
| G4919 | 840.0000 | 422.6667 | 653.4977 | 701.7814 | 6 | 3 | 3 | 23.66667 |
| G4950 | 826.5000 | 642.6667 | 410.2257 | 578.8743 | 6 | 6 | 6 | 22.66667 |
| G9252 | 811.0000 | 722.3077 | 253.9600 | 418.3552 | 13 | 13 | 13 | 19.92308 |
| G4923 | 801.0000 | 531.0000 | 534.6631 | 640.0818 | 4 | 4 | 4 | 24.75000 |
| G4966 | 661.2500 | 447.7500 | 411.7026 | 502.2102 | 8 | 4 | 4 | 22.00000 |
Summary stats can also be plotted
Looking at the data (and not the summary stat)
Summary statistics can be informative and help us understand data, but they can also obfuscate problems in a dataset.
Plotting individual datapoints can help when exploring a new set of data. So let’s look at the departures from a few stations:
Subsetting the data
In order to do a cursory analysis to try to answer our question we opt for subsetting the data.
What does the subset look like?
With a smaller dataset we can more easily plot out individual datapoints.
Interactivity can be very helpful tool when trying to understand visuals and outputs. Especially for larger datasets it can be a timesaver.
After outlier removal
Basic analysis of relation between weather and departure delays
Having found a reasonable subset, we want to see if we can use this to try to answer our question. So we plot temperature vs Departure delays and fit a curve to the data.
There is a small association, lower temp seems to mean greater delay. Significant but a very minor effect.